Time-series analysis with smoothed Convolutional Neural Network

CNN originates from image processing and is not commonly known as a forecasting technique in time-series analysis which depends on the quality of input data. One of the methods to improve the quality is by smoothing the data. This study introduces a novel hybrid exponential smoothing using CNN called Smoothed-CNN (S-CNN). The method of combining tactics outperforms the majority of individual solutions in forecasting. The S-CNN was compared with the original CNN method and other forecasting methods such as Multilayer Perceptron (MLP) and Long Short-Term Memory (LSTM). The dataset is a year time-series of daily website visitors. Since there are no special rules for using the number of hidden layers, the Lucas number was used. The results show that S-CNN is better than MLP and LSTM, with the best MSE of 0.012147693 using 76 hidden layers at 80%:20% data composition.


Smoothing algorithm for time-series
Data smoothing can enhance the quality of data. Smoothing generates excellent results in small dataset forecasting by removing outliers from time-series data [18]. This method is easy to understand and can be effectively implemented in new research without referring to or taking parameters from other studies [19].
Smoothing procedures improve forecasting by averaging the past value of time-series data [20]. The algorithm assigns a weighting value to previous observations to predict future values [21], smooth the value of fluctuations in the data used, and eliminate noise [22]. Generally, there are four common types of data smoothing, which are simple exponential smoothing (SES)/exponential smoothing (ES), moving average (MA), and random walk (RW). In the case of a forecasting task, data smoothing can help researchers predict trends. Table 1 describes the types of data smoothing and the advantages and disadvantages.
In this study, we use exponential smoothing as a data smoothing method. Simple Exponential Smoothing (SES) [23], also known as Exponential Smoothing (ES) [18], was invented by Hyndman and is included in the R software's libraries. Similar to other methods, ES works well for short-term forecasts that take seasonality into account, and the models chosen were evaluated only using MAPE. This method is currently used in a forecasting task due to its performance. Table 2 presents several related works which employ ES for forecasting. Exponential smoothing is a rule-of-thumb approach for smoothing time-series data using the exponential window function. Exponential functions are employed to apply exponentially decreasing weights over time. It is simple to learn and use for making a judgment based on the user's prior assumptions, such as seasonality [24].

Experimental design
In order to conduct a more systematic way of research, we designed the experiment as shown in Fig. 1. We generally compared the smoothed CNN with basic CNN using various datasets. We also used various scenarios and metrics to determine the best scenario. The details of Fig. 1 will be explained in the following subsections.

Dataset
We used 4 different datasets in this study. Table 3 shows the characteristics of each dataset. Dataset 1 is the primary dataset, while the rest is used to test the proposed method's consistency. All of the datasets were multivariate. However, we selected a single attribute (univariate) on each dataset due to the limitation of the study.
The first time-series data in this study is from a journal website of Universitas Negeri Malang (Knowledge Engineering Data Science/KEDS). We retrieved the.csv from the Statcounter machine connected to the web. The dataset contains data within the period of January 2018 to 31 December 2018 [30]. In this study, the input and output of forecasting algorithms are the sessions attribute. Sessions (unique visitors) are the number of  [29] 2020 MTES and LSTM MTES outperformed LSTM in terms of RSME visitors from one IP in a certain period [31]. The number of unique visitors is an essential success indicator of an electronic journal. It measures the breadth of distribution that will accelerate the journal accreditation system [32]. The second dataset is similar to the first instead of the source and the range of the data. The third and fourth datasets are foreign exchange and electrical energy consumption. We used two scenarios to discover the influence of training data composition on forecasting performance. The first scenario used 70% (256 days) data training and 30% (109 days) data for testing. We used 80% (292 days) of the dataset as training in the second scenario while the rest was for testing (73 days). Figure 2 illustrates the scheme of training testing data composition.

Data normalization
The natural behavior of most time-series is dynamic and nonlinear [33]. Data normalization is used to deal with this problem. Because the main objective of data normalization is to ensure the quality of the data before it is fed to any model, it substantially influences the performance of any model [34]. Data normalization is essential for CNN [35] because it can scale the attribute into a specific range required by the activation function. This study uses Min-Max normalization. The method assures that all features have the same scale, although it is inefficient in dealing with outliers. Equation (1) shows the Min-Max formula [36], resulting in normalized data with smaller intervals within 0-1.
X t(norm) is the result of normalization, X t is the data to be normalized, while X min and X max stand for the minimum and maximum value of the entire data.

Exponential smoothing with optimum α
In time-series forecasting, the raw data series is generally denoted as {X t } , with starting time at t = 0 . Here t is a day index. The result of the exponential smoothing process is commonly written as S t , which is considered as the potential future value of X . Equation (2) and (3) offer the single exponential smoothing [37] when t = 0 The smoothed data S t is the result of smoothing the raw data {X t } . The smoothing factor, α is a value that determines the level of smoothing. The range of α is between 0 and 1 (0 ≤ α ≤ 1). When α close to 1, the learning process is fast because it has a less smoothing effect. In contrast, values of α closer to 0 have a more significant smoothing effect and are less responsive to recent changes (slow learning).
The value of α is not the same for every case. Therefore, we promote an optimum value of the smoothing factor based on the dataset characteristics. In this study, we use timeseries data as in Fig. 3. Figure 3 shows the maximum ( X max ) , minimum ( X min ), and average ( 1 n n t=1 X t ) value of the series. We have two considerations in order to define the optimum α . The first is that the average value is less than the difference between X max and X min . The second, the optimum α must be less than 1. Equation (4)(5)(6) shows the optimum α formula.
The substitution of Eq. (6) to (3) results in the following Eq. (7). We use the optimum smoothed result ( S t ) to improve the CNN algorithm performance.

CNN with lucas hidden layers
CNN is the main algorithm of this research. CNN has the capacity to learn meaningful features automatically from high-dimensional data. The input layer used one feature since it is a univariate model. Flatten was used for input to get a fully connected layer. The fully connected layer contains dense for the number of hidden layers.
Instead of using a random number, we used the Lucas number to determine the hidden layer. The Lucas number (Ln) is recursive in the same way as the Fibonacci sequence (Fn), with each term equal to the sum of the two preceding terms, yet with different initial values. This sequence was selected since it provides a golden ratio number. The golden ratio emerges in nature, demonstrating that this enchanted number is ideal for determining the optimal solution to numerous covering problems such as arts, engineering, and financial forecasting [38]. To date, several computer science problems do not have an optimal algorithm. Due to the lack of a better solution in these circumstances, approaches based on random or semi-random solutions are frequently used. Therefore, using the Lucas number is expected to provide an optimal result, in this case, to determine a hidden layer [39]. Figure 4 presents the sequence of Fibonacci and Lucas numbers. In this study, the Lucas number starts from three and ends with the last number before 100, which is 76. We limited the number of hidden layers to avoid the impact of time consumption and efficiency performance. Overall, we used 3, 4, 7, 11, 18, 29, 47, and 76 [40] for the numbers of hidden layers.
There are several lists of different parameters in CNN according to the layer. The CNN forecast component parameters can be seen in Table 4. The parameters selection is based on various research by Ref. [41][42][43][44][45]. A dropout was used during the weight optimization at all layers to avoid overfitting [46]. Dropout is a weight optimization strategy that randomly picks a percentage of neurons at each training period and leaves them out. The dropout value used was 0.2 [47].

Performance testing
All experiments in this study were performed using Python programming language from google collab executor using the Tensorflow and Keras libraries from google chrome browser. We used an Asus VivoBook X407UF Laptop with a 7th generation Intel Core i3 processor, 1 TB hard drive, and 12 GB DDR3 RAM.
We used the mean square error (MSE) and the mean absolute percentage error (MAPE) as error evaluation measures. The MSE was employed to identify anomalies or outliers in the planned projection system [48]. On the other hand, MAPE displayed mistakes that may signify correctness [49]. The formulae are as follows [49]. Table 4 The list of CNN forecast component parameters

Convolutional layer Type of convolutional Conv1D
The number of convolutional layers 3 The number of filters 128 The filter size 2 The activation function ReLU Pooling layer Type of pooling MaXPooling1D The number of pooling layer 1 Size of the pooling window 2 The Activation function output ReLU

Type of optimizer Adam
The number of epochs 10,000 The batch size A t is the actual data value, F t is the forecast value, and n is the number of instances. The better the forecasting outcomes, the less the MSE and MAPE value produced, and hence the better the approach utilized [50]. Based on the MSE and MAPE value computation results, the values show the best forecasting performance. We also recorded the training time of every scenario. The information is used as additional performance indicators. We define the best algorithm as the method with the lowest time consumption.

Results
Tables 5 and 6 present the comparison of CNN and S-CNN in all scenarios. We used α = 0.57 as the smoothing factor of the S-CNN. The hidden layers are various, starting from 3 to 76 of Lucas numbers. These layers were used for all scenarios, including the baseline: MLP and LSTM. Table 5 shows the MSE of CNN and S-CNN in all scenarios. Table 5 presents CNN results using the input data of Scenario 1 with a composition of 70% training and 30% testing data. From Tables 5 and 6, it can be seen that the average MSE value produced is 0.029351031, with an average processing time of 2013s. The highest MSE, 0.039486471, is achieved when the network has 3 hidden layers with a processing  time of 1401 s. The lowest MSE 0.024530865 is generated when the hidden layer is 18, with a processing time of 1810s. The lowest MAPE is in the network with 3 hidden layers (10.38339615). Figure 4 shows the forecasting result of CNN within scenario 1. Tables 5 and 7 show that the number of hidden layers of the lowest MSE of smoothed CNN (S-CNN) is 76. The architecture has 0.020868076 MSE and 3410 s of processing time. The highest MSE, 0.036637854, is achieved when the network has 3 hidden layers with a processing time of 1221 s. This scenario generates the average MSE and processing time of 0.026531364 and 1878s, respectively. For scenario 1, S-CNN with 4 layers produces the best MAPE of 9.45147180. Figure 5 shows the forecasting result of S-CNN within scenario 1. Figure 5 shows the best forecasting results because it has the lowest MSE. Despite the lowest MSE, from Fig. 5, we can see a fairly significant gap at the beginning and middle of the period. Meanwhile, when entering the end of the period, it can be seen in Fig. 5 that the forecasting results are similar to the original value. We can see that Fig. 6 shows a significant difference between the data and forecasting results at the middle and the end of the forecasting period. Nevertheless, the gap between testing data and the result in Fig. 5 is more significant than the gap between testing data and forecasting in Fig. 6. Thus, Fig. 6 is the best architecture due to its low MSE. Figure 7 compares the CNN with the smoothed one. In general, S-CNN is better than the original CNN in terms of MSE. Figure 7a shows that the MSE of S-CNN is lower than CNN, except in the hidden layer 47, in which the MSE values of both are 0.026. The MSE values obtained by the two began to settle when they entered the hidden layer 7 to the last 76, with the average MSE value in that range being 0.026165752 for CNN and 0.023423422 for S-CNN. As Fig. 7b shows, the more hidden layers used, the longer the processing time required. When using the initial three hidden layers, the processing time is the same for both, 1142 s. Meanwhile, when using the last hidden layer, which is when using 76 hidden layers, the processing time required for S-CNN is 111 s faster than CNN. Again, S-CNN processing time is faster than CNN in every scenario. Table 5 show the CNN performance of Scenario 2 with 80% training and 20% testing data composition. The MSE results are 0.013227105 for the lowest and 0.018452732 for the average MSE. From Table 6, the average processing time is 2597 s, averaging the time between 1901 and 3641 s. The best structure for scenario 2 is using 7 hidden layers with MAPE = 9.29571771. Figure 8 presents the forecasting results of CNN within this scenario.   Fig. 8 indicates a substantial disparity at the start and halfway. However, as Fig. 8 indicates, the forecasting results are approaching the initial value as the period ends. Table 5 presents the MSE and processing time of an S-CNN with the various hidden layers. The lowest MSE 0.012147693 happened when the hidden layer number was 76. Nevertheless, in Table 6, the computation is the longest among other architectures with 3351 s. The highest MSE was 0.023890096 due to 11 hidden layers. In Table 7, the lowest MAPE is 9.49165793 for the S-CNN with 29 layers. Figure 9 shows the forecasting results of the S-CNN. Figures 8 and 9 show a considerable change in the forecasted findings. Figure 8 has a greater MSE than Fig. 9, which means that the outlier occurrence of CNN is greater than S-CNN. In Fig. 8, the outliers occurs in almost all periods. On the other hand, the outliers occur in the early and late periods in Fig. 9. Therefore, it can be concluded that smoothing can improve performances by reducing the occurrence of outliers. Figure 10 compares the CNN and S-CNN forecasting of Scenario 2 with the 80%:20% composition of the training and testing data. In general, Fig. 10a shows that the S-CNN has a lower MSE than its original version. It means that the CNN  performance is less accurate than the S-CNN. On the other hand, the computation number of those methods is increasing in line with the rise of hidden layers numbers. In terms of processing time, smoothed CNN is faster than the original CNN in all scenarios, as seen in Fig. 10b. We also compare our optimum α with α between 0.1 and 0.9 [17]. Table 8 shows the performance comparison using various values of smoothing factors. The results show that the optimum α has the lowest average MSE and MAPE. Therefore, our proposed optimum α outperformed other scenarios. Table 9 shows the significance of using Lucas numbers as hidden layers on MSE, MAPE, and training time. The significance is shown when the paired t-test 2-tailed P value < 0.05. The result shows that Lucas numbers have a significant impact on MSE and training time. The insignificance shown in the MAPE results means that the Lucas numbers hidden layers cannot significantly improve the accuracy.
We also used paired t-test to indicate the significance of α on MSE, MAPE, and training time. Since the results in Table 10 are lower than 0.05, the use of α is significant to MSE, MAPE, and training time. In other words, using the smoothing factor is necessary to improve the forecasting performance.
The proposed CNN is compared with other time-series forecasting methods using the same dataset, preprocessing process, and general parameter settings. This   Table 9 Paired T-test result based on lucas hidden layers study uses MLP and LSTM as the baseline. The general parameter setting for MLP and LSTM is the same as the CNN setting in Table 4. Table 11 shows the forecasting comparison of all approaches. In all scenarios, the CNN method has lower MSE and MAPE results than MLP or LSTM. Therefore, forecasting using smoothed CNN (S-CNN) has better performance than the original CNN. We used three more datasets to test the consistency of the best algorithm, S-CNN. The best scenario is used to test the datasets: scenario 2, 76 hidden layers, and smoothing factor based on the statistical parameter of each dataset. The results of the evaluation using various types of datasets and different methods can be seen in Table 12. Table 12 presents the use of the S-CNN method on different datasets to find the best MSE and MAPE values. S-CNN outperformed the baseline in all datasets. The computation of S-CNN is more complex than other methods in Table 12. It is indicated that the time value is more significant than one in LSTM and MLP. Due to the smoothing process, S-CNN is slightly faster than its origin, CNN. Therefore, the results of the performance test are consistent in every dataset.
Overall, the proposed use of the optimum smoothing factor in CNN (S-CNN) may improve the forecasting performance of CNN by reducing the MSE and MAPE. The proposed smoothing factor is limited because it is suitable for seasonal time-series data. Second, the efficiency of the proposed algorithm for multivariate time-series analysis should be considered. Multivariate data has different ranges, units, and dependencies.